This report explores a dataset containing 11 attributes and 1 output attribute for around 5000 instances of white wines. We will try to explore if any of these attributes related to quality of wines which is evaluated by human sensory, and how.
## [1] 4898 13
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median : 5.200 Median :0.04300 Median : 34.00 Median :134.0
## Mean : 6.391 Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :65.800 Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
Our dataset consists of 13 variables and 4898 observations.
Except for the first variable
Xwhich is index of the records, we have 12 attribues of white wines for analysis.
qualityas output attribute of the dataset, ranging from 0 to 10, but actually in this dataset only 3 to 9 score has been placed, and mostly from 5 to 7, other scores are rare. it’s roughly normally distributed.
qualityis actually ordial descret variable, I tranformed it to factor so it will be easier for later analysis.
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
Most acidity locates between 4 to 10, normally distributed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
Right skewed, with most value lower than 0.5, and peak around 0.25
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Right skewed with long tail, and a strange peak around 0.49, zoom in to check the area, wonder why.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Highly right skewed, with most value less than 3, and just a few outlier. Number of records decreases as residual sugar grows.
But when we transform x-axis to log scale, a bimodal appears with two peaks around 2 and 9, and a valley around 3. We’ll analyais later what caused that shape.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Right skewed with most value lower than 0.1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Both data are left skewed with similiar shape. Consider
free sulfur dioxideas part oftotal sulfur dioxide, we’re insterested at the ratio.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
We devide
total sulfur dioxidebyfree sulfur dioxideto get new variantfree sulfur dioxide rate.
Histogram shows it’s lightly right skewed with a few outliers, the ratio of 30% is most popular.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0200 0.1900 0.2500 0.2556 0.3200 0.7100
Right skewed with very long tail. This is reasonable since wines are mostly water, their density should be very much close to 1 g/cm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
## <ScaleContinuousPosition>
## Range:
## Limits: 0 -- 1
The value distribution is symetric around 3.15
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
Right skewed with peak around 0.5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Right skewed, we canroughly recognize 3 parts, log scale didn’t show more interesting things
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
This dataset includes 4,898 instances of white wine. All 12 attributes of wines are continuous numerical, with 1 descret output attribute quality evaluated by score from 0 to 10. Most attributes are skewed (especailly right skewed).
I’m curious how the quality scored by human sensory, is affected by each physical attributes, if any.
There should be correlations exists between some of the attributes themselves, I’ll try to figure them out.
Yes, I created a new variable represeting the rates between free and total of sulfur dioxide.
When I perfromed log sacle to residual sugar, a bimodal shape appeared! There is also an unusual peak with citric acid. They’re intereing me, wondering about the cause. Yes I did transformed quality variable from numerical to factor since it’s actually discrete ordinal gradings.
The matrix provides an overview to the correlatons between each pairs of variables. Concerning
quality, the only linear correlated variable (r > 0.3) isalcohol. It also shows linear correlations (except for several artifacts) betweendensityandresidual.sugar,chlorides,alcohol, between ‘pH’ andfixed.acidity, betweenfixed.acidityand ’free.sulfur.dioxides.rate`
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.02269729 0.289180698
## volatile.acidity -0.02269729 1.00000000 -0.149471811
## citric.acid 0.28918070 -0.14947181 1.000000000
## residual.sugar 0.08902070 0.06428606 0.094211624
## chlorides 0.02308564 0.07051157 0.114364448
## free.sulfur.dioxide -0.04939586 -0.09701194 0.094077221
## total.sulfur.dioxide 0.09106976 0.08926050 0.121130798
## density 0.26533101 0.02711385 0.149502571
## pH -0.42585829 -0.03191537 -0.163748211
## sulphates -0.01714299 -0.03572815 0.062330940
## alcohol -0.12088112 0.06771794 -0.075728730
## quality -0.11366283 -0.19472297 -0.009209091
## free.sulfur.dioxide.rate -0.13909280 -0.19553198 0.016383115
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.08902070 0.02308564 -0.0493958591
## volatile.acidity 0.06428606 0.07051157 -0.0970119393
## citric.acid 0.09421162 0.11436445 0.0940772210
## residual.sugar 1.00000000 0.08868454 0.2990983537
## chlorides 0.08868454 1.00000000 0.1013923521
## free.sulfur.dioxide 0.29909835 0.10139235 1.0000000000
## total.sulfur.dioxide 0.40143931 0.19891030 0.6155009650
## density 0.83896645 0.25721132 0.2942104109
## pH -0.19413345 -0.09043946 -0.0006177961
## sulphates -0.02666437 0.01676288 0.0592172458
## alcohol -0.45063122 -0.36018871 -0.2501039415
## quality -0.09757683 -0.20993441 0.0081580671
## free.sulfur.dioxide.rate 0.05196231 -0.03363087 0.7386123078
## total.sulfur.dioxide density pH
## fixed.acidity 0.091069756 0.26533101 -0.4258582910
## volatile.acidity 0.089260504 0.02711385 -0.0319153683
## citric.acid 0.121130798 0.14950257 -0.1637482114
## residual.sugar 0.401439311 0.83896645 -0.1941334540
## chlorides 0.198910300 0.25721132 -0.0904394560
## free.sulfur.dioxide 0.615500965 0.29421041 -0.0006177961
## total.sulfur.dioxide 1.000000000 0.52988132 0.0023209718
## density 0.529881324 1.00000000 -0.0935914935
## pH 0.002320972 -0.09359149 1.0000000000
## sulphates 0.134562367 0.07449315 0.1559514973
## alcohol -0.448892102 -0.78013762 0.1214320987
## quality -0.174737218 -0.30712331 0.0994272457
## free.sulfur.dioxide.rate -0.012930593 -0.06535628 0.0004666430
## sulphates alcohol quality
## fixed.acidity -0.01714299 -0.12088112 -0.113662831
## volatile.acidity -0.03572815 0.06771794 -0.194722969
## citric.acid 0.06233094 -0.07572873 -0.009209091
## residual.sugar -0.02666437 -0.45063122 -0.097576829
## chlorides 0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide 0.05921725 -0.25010394 0.008158067
## total.sulfur.dioxide 0.13456237 -0.44889210 -0.174737218
## density 0.07449315 -0.78013762 -0.307123313
## pH 0.15595150 0.12143210 0.099427246
## sulphates 1.00000000 -0.01743277 0.053677877
## alcohol -0.01743277 1.00000000 0.435574715
## quality 0.05367788 0.43557472 1.000000000
## free.sulfur.dioxide.rate -0.02261212 0.06489615 0.197649851
## free.sulfur.dioxide.rate
## fixed.acidity -0.139092802
## volatile.acidity -0.195531983
## citric.acid 0.016383115
## residual.sugar 0.051962312
## chlorides -0.033630873
## free.sulfur.dioxide 0.738612308
## total.sulfur.dioxide -0.012930593
## density -0.065356278
## pH 0.000466643
## sulphates -0.022612119
## alcohol 0.064896146
## quality 0.197649851
## free.sulfur.dioxide.rate 1.000000000
We can detect several linear correlations pairing with
density, let’s plot them.
resudual.sugar,chloridesandtotal.sulfur.dioxideare increase as density increses, that is reasonable considerinh the condensinh of dry materials, butalcoholis reversed.
While noticing the similiar shape of
resudual.sugarandalcoholplots, I ploted the 5th graph to see if they’re correlated too. The result suports the negative correlation. This is reasonable since I get knowledged on internet that: sugar will tranform to alcohol during the fermentation, and alcohol is ligher than water, that brings down the overall density of the wine.
Scatter plots show their negative correlation. This is totally understandable since acidity means low pH value, but we can’t see such correlation of pH with two other acidity attributes.
The bimodal implies there could be different categories exist among wines.
I get some information from internet that wines are usually classified to following categories accroding to their residual sugar:
- dry (< 4 g/dm^3)
- semi-dry (4 ~ 12 g/dm^3)
- semi-sweet (12 ~ 45 g/dm^3)
- sweet (> 45 g/dm^3)
Let’s create a new variable
categoryby cutting theresidual.sugarvalues.
## residual.sugar category
## 1 20.7 semi-sweet
## 2 1.6 dry
## 3 6.9 semi-dry
## 4 8.5 semi-dry
## 5 8.5 semi-dry
##
## dry semi-dry semi-sweet sweet
## 2097 1975 825 0
There is only one case classified as ‘sweet’, I removed this case since it has no meaning for following analysis.
Evidently, wine belongs to dry and semi-dry are distributed on both sides of the valley, and we even noticed a 3rd part for semi-sweet. That explains the bimodal shape.
As we can see through the matrix, there is no remarkable correlations detected by eyes, maybe except for
alcohol.
I performed a cor test focusing on
quality(as numerical value instead of categorical), that supports the visual conclustion, onlyalcoholget a r value obviously greater than 0.3, along withdensitymildly past -0.3. Let’s plot for them.
## # A tibble: 12 x 2
## rowname quality
## <chr> <dbl>
## 1 fixed.acidity -0.114
## 2 volatile.acidity -0.195
## 3 citric.acid -0.00921
## 4 residual.sugar -0.0976
## 5 chlorides -0.210
## 6 free.sulfur.dioxide 0.00816
## 7 total.sulfur.dioxide -0.175
## 8 density -0.307
## 9 pH 0.0994
## 10 sulphates 0.0537
## 11 alcohol 0.436
## 12 free.sulfur.dioxide.rate 0.198
As we can see, the
qualityscore descreses firstly before reach to valley of 5, then after that, the score grows up steadily.
And we can also see,
qualityresponds todensityis almost reversed. This is reasonable as we already knowdensityandalcoholis correlated themselves.
So which one really play a role on affecting
quality?
We set off hoping to figure out how the 12 attributes affect the quality scores. But actually I only this find one attribue alcohol is playing the role. So either there’re other useful physical attributes not included, or we may guess, human sensory is not that reliable.
Devid wines to 4 categories based on their residual sugar explained the bimodal we detected on univariate analysis. An the relationship between residual sugar, alcohol and density is intereting.
Between residual.sugar and density, with r value as high as 0.839.
Since
sulpatesis described as “a wine additive which can contribute to sulfur dioxide gas (S02) levels”, we’re expected to see some relationship between them.
We start by creating a new variable
bound.sulfur.dioxideby subtractingfree.sulfur.dioxidefromtotal.sulfur.dioxide
That is disappointing, we didn’t reveal the linear correlation between them.
We made another plot by seperating free and bound sulfur dioxide, and take a log to sulphates, it only shows a mildly positive relationship between bound sulfur and sulphates.
We can see both free and bound sulfur dioxide are increasing with resigual sugar (category). Since sulfur dioxide prevents microbial growth and the oxidation of wine, it seems reasonable that wines with more sugar need more sulfur dioxide.
This plot shows the tendancy of lower total sulfur dioxide with higher quality. And the change is mostly contributes by bound ones, the free ones almost keeps no change across different qualities.
I find follwing information regarding Sulfites in Wine
- Other factors that affect how much sulfite is needed are the residual sugar and the acidity of the wine. Dryer wines with more acid will tend to be lower in sulfites. Sweet wines and dessert wines, on the other hand, tend to be quite high in sulfites.
Let’s plot to see if that’s true for our data:
We do see sweeter wines gathering around higher end of sulfur dioxide, but we can’t see they gathering around more-acid end (lower pH). The reason is unknown.
We found sulfur dioxide is also affected by residual sugar, and it also strengthened the correlation we found between residual sugar and quality.
I tried to rediscover the relationship between sulfur dioxide, acid and residual sugar, it did shows the correlation between sulfur dioxide/sugar, but failed with acid/sugar.
Residual Sugar is not evenly distributed among this dataset, but gathering around three peaks which presents 3 categories: dry, semi-dry, and semi-sweet.
Density of wines are usually under 1 g/cm^3 (density of water). And it’s one of the outstanding feature that correlates with wine’s quality score. We can see wines with high scores tends to have lower density(which also indicate lower residual sugar and higher alcohol). We can also notice a slightly bimodal density with higher score wines, that also demonstrates the bimodal/trimodal situation we revealed with Plot One.
Wines with higher quality score tens to contain lower total sulfur dioxide, the differences are mostly contributed by bound sulfur dioxide, while the free ones almost keeps no change across different scores.
I set out this analysis with expectation to find factors that impacts white wine quality. It turned out that a series of features correlated each other, like residual sugar, alcohol, density, sulfur dioxide, influence the scores together. It’s hard to tell which one actually affects human tastes most.
During the analysis, we did reveal some industry experiences or physical rules with wines, for example, the more sugar consumed during fermenting, the more alcohol generated, which condensed the wine, and decreased it’s density. But I failed to reveal the relationships amond sugar, sulfur and acid.
The whole dataset contains around 5000 of record, but only includes 1 case that can be classified at ‘sweet’ wine. With more records on sweet wines we might able to detect more correlations that is not noticable now. And to compare the white wine dataset with red wine might also help us to find more interesting features regarding wines.